The 2019-2020 Coronavirus Pandemic Analysis

Contact: Smith Research

BACKGROUND & APPROACH

I wanted to track and trend the coronavirus outbreak on my own curiosity. There are some interesting questions that may fall out of this, as it is a very historic moment, including scientifically and analytically (we have a large amount of data being shared across the globe, analyzed in real-time). The world has come to a halt because of it.
This analysis attempts to answer the following questions (more to come):

  1. What does the trend of the pandemic look like to date?
  2. What are future case predictions based on historical model?
  3. What interesting quirks or patterns emerge?

ASSUMPTIONS & LIMITATIONS: * This data is limited by the source. I realized early on that depending on source there were conflicting # of cases. Originally I was using JHU data… but this was always ‘ahead’ of the Our World In Data. I noticed that JHU’s website was buggy- you clicked on the U.S. stats but it didn’t reflect the U.S.. So I changed data sources to be more consistent with what is presented in the media (and Our World In Data has more extensive plots I can compare my own to). An interesting aside might be why the discrepancy? Was I missing something?
* Defintiions are important as is the idea that multiple varibales accumulate in things like total cases (more testing for example).

SOURCE RAW DATA: * https://ourworldindata.org/coronavirus
* https://github.com/CSSEGISandData/COVID-19/
*

INPUT DATA LOCATION: github (https://github.com/sbs87/coronavirus/tree/master/data)

OUTPUT DATA LOCATIOn: github (https://github.com/sbs87/coronavirus/tree/master/results)

TIMESTAMP

Start: ##—— Tue May 26 20:35:10 2020 ——##

PRE-ANALYSIS

The following sections are outside the scope of the ‘analysis’ but are still needed to prepare everything

UPSTREAM PROCESSING/ANALYSIS

  1. Google Mobility Scraping, script available at get_google_mobility.py
# Mobility data has to be extracted from Google PDF reports using a web scraping script (python , written by Peter Simone, https://github.com/petersim1/MIT_COVID19)

# See get_google_mobility.py for local script 

python3 get_google_mobility.py
# writes csv file of mobility data as "mobility.csv"

SET UP ENVIORNMENT

Load libraries and set global variables

# timestamp start
timestamp()
## ##------ Tue May 26 20:35:10 2020 ------##

# clear previous enviornment
rm(list = ls())

##------------------------------------------
## LIBRARIES
##------------------------------------------
library(plyr)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.0     ✓ purrr   0.3.3
## ✓ tibble  3.0.0     ✓ dplyr   0.8.5
## ✓ tidyr   1.0.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::arrange()   masks plyr::arrange()
## x purrr::compact()   masks plyr::compact()
## x dplyr::count()     masks plyr::count()
## x dplyr::failwith()  masks plyr::failwith()
## x dplyr::filter()    masks stats::filter()
## x dplyr::id()        masks plyr::id()
## x dplyr::lag()       masks stats::lag()
## x dplyr::mutate()    masks plyr::mutate()
## x dplyr::rename()    masks plyr::rename()
## x dplyr::summarise() masks plyr::summarise()
## x dplyr::summarize() masks plyr::summarize()
library(ggplot2)
library(reshape2)
## 
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
## 
##     smiths
library(plot.utils)
library(utils)
library(knitr)

##------------------------------------------

##------------------------------------------
# GLOBAL VARIABLES
##------------------------------------------
user_name <- Sys.info()["user"]
working_dir <- paste0("/Users/", user_name, "/Projects/coronavirus/")  # don't forget trailing /
results_dir <- paste0(working_dir, "results/")  # assumes diretory exists
results_dir_custom <- paste0(results_dir, "custom/")  # assumes diretory exists


Corona_Cases.source_url <- "https://github.com/CSSEGISandData/COVID-19/raw/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv"
Corona_Cases.US.source_url <- "https://github.com/CSSEGISandData/COVID-19/raw/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_US.csv"
Corona_Deaths.US.source_url <- "https://github.com/CSSEGISandData/COVID-19/raw/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_US.csv"
Corona_Deaths.source_url <- "https://github.com/CSSEGISandData/COVID-19/raw/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv"

Corona_Cases.fn <- paste0(working_dir, "data/", basename(Corona_Cases.source_url))
Corona_Cases.US.fn <- paste0(working_dir, "data/", basename(Corona_Cases.US.source_url))
Corona_Deaths.fn <- paste0(working_dir, "data/", basename(Corona_Deaths.source_url))
Corona_Deaths.US.fn <- paste0(working_dir, "data/", basename(Corona_Deaths.US.source_url))
default_theme <- theme_bw() + theme(text = element_text(size = 14))  # fix this
##------------------------------------------

FUNCTIONS

List of functions

function_name description
prediction_model outputs case estumate for given log-linear moder parameters slope and intercept
make_long converts input data to long format (specialized cases)
name_overlaps outputs the column names intersection and set diffs of two data frame
find_linear_index finds the first date at which linearaity occurs
##------------------------------------------
## FUNCTION: prediction_model
##------------------------------------------
## --- //// ----
# Takes days vs log10 (case) linear model parameters and a set of days since 100 cases and outputs a dataframe with total number of predicted cases for those days
## --- //// ----
prediction_model<-function(m=1,b=0,days=1){
  total_cases<-m*days+b
  total_cases.log<-log(total_cases,10)
  prediction<-data.frame(days=days,Total_confirmed_cases_perstate=total_cases)
  return(prediction)
}
##------------------------------------------

##------------------------------------------
## FUNCTION: make_long
##------------------------------------------
## --- //// ----
# Takes wide-format case data and converts into long format, using date and total cases as variable/values. Also enforces standardization/assumes data struture naming by using fixed variable name, value name, id.vars, 
## --- //// ----
make_long<-function(data_in,variable.name = "Date",
                   value.name = "Total_confirmed_cases",
                   id.vars=c("case_type","Province.State","Country.Region","Lat","Long","City","Population")){

long_data<-melt(data_in,
                id.vars = id.vars,
                variable.name=variable.name,
                value.name=value.name)
return(long_data)

}
##------------------------------------------

## THIS WILL BE IN UTILS AT SOME POINT
name_overlaps<-function(df1,df2){
i<-intersect(names(df1),
names(df2))
sd1<-setdiff(names(df1),
names(df2))
sd2<-setdiff(names(df2),names(df1))
cat("intersection:\n",paste(i,"\n"))
cat("in df1 but not df2:\n",paste(sd1,"\n"))
cat("in df2 but not df1:\n",paste(sd2,"\n"))
return(list("int"=i,"sd_1_2"=sd1,"sd_2_1"=sd2))
}

##------------------------------------------

##------------------------------------------
## FUNCTION: find_linear_index
##------------------------------------------
## --- //// ----
# Find date at which total case data is linear (for a given data frame) 
## --- //// ----

find_linear_index<-function(tmp,running_avg=5){
  tmp$Total_confirmed_cases_perstate.log<-log(tmp$Total_confirmed_cases_perstate,2)
  derivative<-data.frame(matrix(nrow = nrow(tmp),ncol = 4))
  names(derivative)<-c("m.time","mm.time","cumsum","date")
  
  # First derivative
  for(t in 2:nrow(tmp)){
    slope.t<- tmp[t,"Total_confirmed_cases_perstate.log"]- tmp[t-1,"Total_confirmed_cases_perstate.log"]
    derivative[t,"m.time"]<-slope.t
    derivative[t,"date"]<-as.Date(tmp[t,"Date"])
  }
  
  # Second derivative
  for(t in 2:nrow(derivative)){
    slope.t<- derivative[t,"m.time"]- derivative[t-1,"m.time"]
    derivative[t,"mm.time"]<-slope.t
  }
  
  #Compute running sum of second derivative (window = 5). Choose point at which within 0.2
  for(t in running_avg:nrow(derivative)){
    slope.t<- sum(abs(derivative[t:(t-4),"mm.time"])<0.2,na.rm = T)
    derivative[t,"cumsum"]<-slope.t
  }
  
  #Find date -5 from the stablility point
  linear_begin<-min(derivative[!is.na(derivative$cumsum) & derivative$cumsum==running_avg,"date"])-running_avg
  
  return(linear_begin)
}

READ IN DATA

# Q: do we want to archive previous versions? Maybe an auto git mv?

##------------------------------------------
## Download and read in latest data from github
##------------------------------------------
download.file(Corona_Cases.source_url, destfile = Corona_Cases.fn)
Corona_Totals.raw <- read.csv(Corona_Cases.fn, header = T, stringsAsFactors = F)

download.file(Corona_Cases.US.source_url, destfile = Corona_Cases.US.fn)
Corona_Totals.US.raw <- read.csv(Corona_Cases.US.fn, header = T, stringsAsFactors = F)

download.file(Corona_Deaths.source_url, destfile = Corona_Deaths.fn)
Corona_Deaths.raw <- read.csv(Corona_Deaths.fn, header = T, stringsAsFactors = F)

download.file(Corona_Deaths.US.source_url, destfile = Corona_Deaths.US.fn)
Corona_Deaths.US.raw <- read.csv(Corona_Deaths.US.fn, header = T, stringsAsFactors = F)

# latest date on all data:
paste("US deaths:", names(Corona_Deaths.US.raw)[ncol(Corona_Deaths.US.raw)])
## [1] "US deaths: X5.25.20"
paste("US total:", names(Corona_Totals.US.raw)[ncol(Corona_Totals.US.raw)])
## [1] "US total: X5.25.20"
paste("World deaths:", names(Corona_Deaths.raw)[ncol(Corona_Deaths.raw)])
## [1] "World deaths: X5.25.20"
paste("World total:", names(Corona_Totals.raw)[ncol(Corona_Totals.raw)])
## [1] "World total: X5.25.20"

PROCESS DATA

  • Convert to long format
  • Fix date formatting/convert to numeric date
  • Log10 transform total # cases
##------------------------------------------
## Combine death and total data frames
##------------------------------------------
Corona_Totals.raw$case_type<-"total"
Corona_Totals.US.raw$case_type<-"total"
Corona_Deaths.raw$case_type<-"death"
Corona_Deaths.US.raw$case_type<-"death"

# for some reason, Population listed in US death file but not for other data... Weird. When combining, all datasets will have this column, but US deaths is the only useful one.  
Corona_Totals.US.raw$Population<-"NA" 
Corona_Totals.raw$Population<-"NA"
Corona_Deaths.raw$Population<-"NA"

Corona_Cases.raw<-rbind(Corona_Totals.raw,Corona_Deaths.raw)
Corona_Cases.US.raw<-rbind(Corona_Totals.US.raw,Corona_Deaths.US.raw)
#TODO: custom utils- setdiff, intersect names... option to output in merging too
##------------------------------------------
# prepare raw datasets for eventual combining
##------------------------------------------
Corona_Cases.raw$City<-"NA" # US-level data has Cities
Corona_Cases.US.raw$Country_Region<-"US_state" # To differentiate from World-level stats

Corona_Cases.US.raw<-plyr::rename(Corona_Cases.US.raw,c("Province_State"="Province.State",
                                                  "Country_Region"="Country.Region",
                                                  "Long_"="Long",
                                                  "Admin2"="City"))


##------------------------------------------
## Convert to long format
##------------------------------------------
#JHU has a gross file format. It's in wide format with each column is the date in MM/DD/YY. So read this in as raw data but trasnform it to be better suited for analysis
# Furthermore, the World and US level data is formatted differently, containing different columns, etc. Recitfy this and combine the world-level stats with U.S. level stats.

Corona_Cases.long<-rbind(make_long(select(Corona_Cases.US.raw,-c(UID,iso2,iso3,code3,FIPS,Combined_Key))),
make_long(Corona_Cases.raw))


##------------------------------------------
## Fix date formatting, convert to numeric date
##------------------------------------------
Corona_Cases.long$Date<-gsub(Corona_Cases.long$Date,pattern = "^X",replacement = "0") # leading 0 read in as X
Corona_Cases.long$Date<-gsub(Corona_Cases.long$Date,pattern = "20$",replacement = "2020") # ends in .20 and not 2020
Corona_Cases.long$Date<-as.Date(Corona_Cases.long$Date,format = "%m.%d.%y")
Corona_Cases.long$Date.numeric<-as.numeric(Corona_Cases.long$Date)

kable(table(select(Corona_Cases.long,c("Country.Region","case_type"))),caption = "Number of death and total case longitudinal datapoints per geographical region")
Number of death and total case longitudinal datapoints per geographical region
death total
Afghanistan 125 125
Albania 125 125
Algeria 125 125
Andorra 125 125
Angola 125 125
Antigua and Barbuda 125 125
Argentina 125 125
Armenia 125 125
Australia 1000 1000
Austria 125 125
Azerbaijan 125 125
Bahamas 125 125
Bahrain 125 125
Bangladesh 125 125
Barbados 125 125
Belarus 125 125
Belgium 125 125
Belize 125 125
Benin 125 125
Bhutan 125 125
Bolivia 125 125
Bosnia and Herzegovina 125 125
Botswana 125 125
Brazil 125 125
Brunei 125 125
Bulgaria 125 125
Burkina Faso 125 125
Burma 125 125
Burundi 125 125
Cabo Verde 125 125
Cambodia 125 125
Cameroon 125 125
Canada 1750 1750
Central African Republic 125 125
Chad 125 125
Chile 125 125
China 4125 4125
Colombia 125 125
Comoros 125 125
Congo (Brazzaville) 125 125
Congo (Kinshasa) 125 125
Costa Rica 125 125
Cote d’Ivoire 125 125
Croatia 125 125
Cuba 125 125
Cyprus 125 125
Czechia 125 125
Denmark 375 375
Diamond Princess 125 125
Djibouti 125 125
Dominica 125 125
Dominican Republic 125 125
Ecuador 125 125
Egypt 125 125
El Salvador 125 125
Equatorial Guinea 125 125
Eritrea 125 125
Estonia 125 125
Eswatini 125 125
Ethiopia 125 125
Fiji 125 125
Finland 125 125
France 1375 1375
Gabon 125 125
Gambia 125 125
Georgia 125 125
Germany 125 125
Ghana 125 125
Greece 125 125
Grenada 125 125
Guatemala 125 125
Guinea 125 125
Guinea-Bissau 125 125
Guyana 125 125
Haiti 125 125
Holy See 125 125
Honduras 125 125
Hungary 125 125
Iceland 125 125
India 125 125
Indonesia 125 125
Iran 125 125
Iraq 125 125
Ireland 125 125
Israel 125 125
Italy 125 125
Jamaica 125 125
Japan 125 125
Jordan 125 125
Kazakhstan 125 125
Kenya 125 125
Korea, South 125 125
Kosovo 125 125
Kuwait 125 125
Kyrgyzstan 125 125
Laos 125 125
Latvia 125 125
Lebanon 125 125
Lesotho 125 125
Liberia 125 125
Libya 125 125
Liechtenstein 125 125
Lithuania 125 125
Luxembourg 125 125
Madagascar 125 125
Malawi 125 125
Malaysia 125 125
Maldives 125 125
Mali 125 125
Malta 125 125
Mauritania 125 125
Mauritius 125 125
Mexico 125 125
Moldova 125 125
Monaco 125 125
Mongolia 125 125
Montenegro 125 125
Morocco 125 125
Mozambique 125 125
MS Zaandam 125 125
Namibia 125 125
Nepal 125 125
Netherlands 625 625
New Zealand 125 125
Nicaragua 125 125
Niger 125 125
Nigeria 125 125
North Macedonia 125 125
Norway 125 125
Oman 125 125
Pakistan 125 125
Panama 125 125
Papua New Guinea 125 125
Paraguay 125 125
Peru 125 125
Philippines 125 125
Poland 125 125
Portugal 125 125
Qatar 125 125
Romania 125 125
Russia 125 125
Rwanda 125 125
Saint Kitts and Nevis 125 125
Saint Lucia 125 125
Saint Vincent and the Grenadines 125 125
San Marino 125 125
Sao Tome and Principe 125 125
Saudi Arabia 125 125
Senegal 125 125
Serbia 125 125
Seychelles 125 125
Sierra Leone 125 125
Singapore 125 125
Slovakia 125 125
Slovenia 125 125
Somalia 125 125
South Africa 125 125
South Sudan 125 125
Spain 125 125
Sri Lanka 125 125
Sudan 125 125
Suriname 125 125
Sweden 125 125
Switzerland 125 125
Syria 125 125
Taiwan* 125 125
Tajikistan 125 125
Tanzania 125 125
Thailand 125 125
Timor-Leste 125 125
Togo 125 125
Trinidad and Tobago 125 125
Tunisia 125 125
Turkey 125 125
Uganda 125 125
Ukraine 125 125
United Arab Emirates 125 125
United Kingdom 1375 1375
Uruguay 125 125
US 125 125
US_state 407625 407625
Uzbekistan 125 125
Venezuela 125 125
Vietnam 125 125
West Bank and Gaza 125 125
Western Sahara 125 125
Yemen 125 125
Zambia 125 125
Zimbabwe 125 125
# Decouple population and lat/long data, refactor to make it more tidy
metadata_columns<-c("Lat","Long","Population")
metadata<-unique(select(filter(Corona_Cases.long,case_type=="death"),c("Country.Region","Province.State","City",all_of(metadata_columns))))
Corona_Cases.long<-select(Corona_Cases.long,-all_of(metadata_columns))

# Some counties are not summarized on the country level. collapse all but US
Corona_Cases.long<-rbind.fill(ddply(filter(Corona_Cases.long,!Country.Region=="US_state"),c("case_type","Country.Region","Date","Date.numeric"),summarise,Total_confirmed_cases=sum(Total_confirmed_cases)),filter(Corona_Cases.long,Country.Region=="US_state"))

# Put total case and deaths side-by-side (wide)
Corona_Cases<-spread(Corona_Cases.long,key = case_type,value = Total_confirmed_cases)

#Compute moratlity rate
Corona_Cases$mortality_rate<-Corona_Cases$death/Corona_Cases$total

#TMP
Corona_Cases<-plyr::rename(Corona_Cases,c("total"="Total_confirmed_cases","death"="Total_confirmed_deaths"))

##------------------------------------------
## log10 transform total # cases
##------------------------------------------
Corona_Cases$Total_confirmed_cases.log<-log(Corona_Cases$Total_confirmed_cases,10)
Corona_Cases$Total_confirmed_deaths.log<-log(Corona_Cases$Total_confirmed_deaths,10)
##------------------------------------------
       
##------------------------------------------
## Compute # of days since 100th for US data
##------------------------------------------

# Find day that 100th case was found for Country/Province. NOTE: Non US countries may have weird provinces. For example, Frane is summairzed at the country level but also had 3 providences. I've only ensured the U.S. case100 works... so the case100_date for U.S. is summarized both for the entire country (regardless of state) and on a per-state level. 
# TODO: consider city-level summary as well. This data may be sparse

Corona_Cases<-merge(Corona_Cases,ddply(filter(Corona_Cases,Total_confirmed_cases>100),c("Country.Region"),summarise,case100_date=min(Date.numeric)))
Corona_Cases$Days_since_100<-Corona_Cases$Date.numeric-Corona_Cases$case100_date

##------------------------------------------
## Add population and lat/long data (CURRENTLY US ONLY)
##------------------------------------------

kable(filter(metadata,(is.na(Country.Region) | is.na(Population) )) %>% select(c("Country.Region","Province.State","City")) %>% unique(),caption = "Regions for which either population or Country is NA")
Regions for which either population or Country is NA
Country.Region Province.State City
# Drop missing data 
metadata<-filter(metadata,!(is.na(Country.Region) | is.na(Population) ))
# Convert remaining pop to numeric
metadata$Population<-as.numeric(metadata$Population)
## Warning: NAs introduced by coercion
# Add metadata to cases
Corona_Cases<-merge(Corona_Cases,metadata,all.x = T)

##------------------------------------------
## Compute total and death cases relative to population 
##------------------------------------------

Corona_Cases$Total_confirmed_cases.per100<-100*Corona_Cases$Total_confirmed_cases/Corona_Cases$Population
Corona_Cases$Total_confirmed_deaths.per100<-100*Corona_Cases$Total_confirmed_deaths/Corona_Cases$Population


##------------------------------------------
## Filter df for US state-wide stats
##------------------------------------------

Corona_Cases.US_state<-filter(Corona_Cases,Country.Region=="US_state" & Total_confirmed_cases>0 ) 
kable(table(select(Corona_Cases.US_state,c("Province.State"))),caption = "Number of longitudinal datapoints (total/death) per state")
Number of longitudinal datapoints (total/death) per state
Var1 Freq
Alabama 4199
Alaska 777
Arizona 1061
Arkansas 4380
California 4047
Colorado 3827
Connecticut 628
Delaware 259
Diamond Princess 70
District of Columbia 71
Florida 4498
Georgia 10010
Grand Princess 71
Guam 71
Hawaii 367
Idaho 1991
Illinois 5706
Indiana 5825
Iowa 5283
Kansas 4351
Kentucky 6355
Louisiana 4215
Maine 1068
Maryland 1649
Massachusetts 1071
Michigan 5081
Minnesota 4758
Mississippi 5224
Missouri 5740
Montana 1790
Nebraska 3256
Nevada 793
New Hampshire 741
New Jersey 1597
New Mexico 1748
New York 4033
North Carolina 6136
North Dakota 2050
Northern Mariana Islands 56
Ohio 5477
Oklahoma 4122
Oregon 2108
Pennsylvania 4333
Puerto Rico 71
Rhode Island 418
South Carolina 3053
South Dakota 2609
Tennessee 5913
Texas 12317
Utah 980
Vermont 990
Virgin Islands 71
Virginia 7722
Washington 2761
West Virginia 2800
Wisconsin 4130
Wyoming 1275
Corona_Cases.US_state<-merge(Corona_Cases.US_state,ddply(filter(Corona_Cases.US_state,Total_confirmed_cases>100),c("Province.State"),summarise,case100_date_state=min(Date.numeric)))
Corona_Cases.US_state$Days_since_100_state<-Corona_Cases.US_state$Date.numeric-Corona_Cases.US_state$case100_date_state

ANALYSIS

Q1: What is the trend in cases, mortality across geopgraphical regions?

Plot # of cases vs time
* For each geographical set:
* comparative longitudinal case trend (absolute & log scale)
* comparative longitudinal mortality trend
* death vs total correlation

question dataset x y color facet pch dimentions
comparative_longitudinal_case_trend long time log_cases geography none (case type?) case_type [15, 50, 4] geography x (2 scale?) case type
comparative longitudinal case trend long time cases geography case_type ? [15, 50, 4] geography x (2+ scale) case type
comparative longitudinal mortality trend wide time mortality rate geography none none [15, 50, 4] geography
death vs total correlation wide cases deaths geography none none [15, 50, 4] geography
# total cases vs time
# death cases vs time
# mortality rate vs time
# death vs mortality


  # death vs mortality
  # total & death case vs time (same plot)

#<question> <x> <y> <colored> <facet> <dataset>
## trend in case/deaths over time, comapred across regions <time> <log cases> <geography*> <none> <.wide>
## trend in case/deaths over time, comapred across regions <time> <cases> <geography*> <case_type> <.long>
## trend in mortality rate over time, comapred across regions <time> <mortality rate> <geography*> <none>
## how are death/mortality related/correlated? <time> <log cases> <geography*> <none>
## how are death and case load correlated? <cases> <deaths>

# lm for each?? - > apply lm from each region starting from 100th case. m, b associated with each.
    # input: geographical regsion, logcase vs day (100th case)
    # output: m, b for each geographical region ID



#total/death on same plot-  diffeer by 2 logs, so when plotting log, use pch. when plotting absolute, need to use free scales
#when plotting death and case on same, melt. 

#CoronaCases - > filter sets (3)
  #world - choose countries with sufficent data

N<-ddply(filter(Corona_Cases,Total_confirmed_cases>100),c("Country.Region"),summarise,n=length(Country.Region))
ggplot(filter(N,n<100),aes(x=n))+
  geom_histogram()+
  default_theme+
  ggtitle("Distribution of number of days with at least 100 confirmed cases for each region")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

kable(arrange(N,-n),caption="Sorted number of days with at least 100 confirmed cases")
Sorted number of days with at least 100 confirmed cases
Country.Region n
US_state 37572
China 125
Diamond Princess 106
Korea, South 96
Japan 95
Italy 93
Iran 90
Singapore 87
France 86
Germany 86
Spain 85
US 84
Switzerland 82
United Kingdom 82
Belgium 81
Netherlands 81
Norway 81
Sweden 81
Austria 79
Malaysia 78
Australia 77
Bahrain 77
Denmark 77
Canada 76
Qatar 76
Iceland 75
Brazil 74
Czechia 74
Finland 74
Greece 74
Iraq 74
Israel 74
Portugal 74
Slovenia 74
Egypt 73
Estonia 73
India 73
Ireland 73
Kuwait 73
Philippines 73
Poland 73
Romania 73
Saudi Arabia 73
Indonesia 72
Lebanon 72
San Marino 72
Thailand 72
Chile 71
Pakistan 71
Luxembourg 70
Peru 70
Russia 70
Ecuador 69
Mexico 69
Slovakia 69
South Africa 69
United Arab Emirates 69
Armenia 68
Colombia 68
Croatia 68
Panama 68
Serbia 68
Taiwan* 68
Turkey 68
Argentina 67
Bulgaria 67
Latvia 67
Uruguay 67
Algeria 66
Costa Rica 66
Dominican Republic 66
Hungary 66
Andorra 65
Bosnia and Herzegovina 65
Jordan 65
Lithuania 65
Morocco 65
New Zealand 65
North Macedonia 65
Vietnam 65
Albania 64
Cyprus 64
Malta 64
Moldova 64
Brunei 63
Burkina Faso 63
Sri Lanka 63
Tunisia 63
Ukraine 62
Azerbaijan 61
Ghana 61
Kazakhstan 61
Oman 61
Senegal 61
Venezuela 61
Afghanistan 60
Cote d’Ivoire 60
Cuba 59
Mauritius 59
Uzbekistan 59
Cambodia 58
Cameroon 58
Honduras 58
Nigeria 58
West Bank and Gaza 58
Belarus 57
Georgia 57
Bolivia 56
Kosovo 56
Kyrgyzstan 56
Montenegro 56
Congo (Kinshasa) 55
Kenya 54
Niger 53
Guinea 52
Rwanda 52
Trinidad and Tobago 52
Paraguay 51
Bangladesh 50
Djibouti 48
El Salvador 47
Guatemala 46
Madagascar 45
Mali 44
Congo (Brazzaville) 41
Jamaica 41
Gabon 39
Somalia 39
Tanzania 39
Ethiopia 38
Burma 37
Sudan 36
Liberia 35
Maldives 33
Equatorial Guinea 32
Cabo Verde 30
Sierra Leone 28
Guinea-Bissau 27
Togo 27
Zambia 26
Eswatini 25
Chad 24
Tajikistan 23
Haiti 21
Sao Tome and Principe 21
Benin 19
Nepal 19
Uganda 19
Central African Republic 18
South Sudan 18
Guyana 16
Mozambique 15
Yemen 11
Mongolia 10
Mauritania 7
Nicaragua 7
Malawi 1
Syria 1
# Pick top 15 countries with data
max_colors<-12
# find way to fix this- China has diff provences. Plot doesnt look right...
sufficient_data<-arrange(filter(N,!Country.Region %in% c("US_state", "Diamond Princess")),-n)[1:max_colors,]
kable(sufficient_data,caption = paste0("Top ",max_colors," countries with sufficient data"))
Top 12 countries with sufficient data
Country.Region n
China 125
Korea, South 96
Japan 95
Italy 93
Iran 90
Singapore 87
France 86
Germany 86
Spain 85
US 84
Switzerland 82
United Kingdom 82
Corona_Cases.world<-filter(Corona_Cases,Country.Region %in% c(sufficient_data$Country.Region))


  #us 
  #    - by state
Corona_Cases.US<-filter(Corona_Cases,Country.Region=="US" & Total_confirmed_cases>0)
# summarize 
#!City %in% c("Unassigned") 
  #    - specific cities
#mortality_rate!=Inf & mortality_rate<=1
Corona_Cases.UScity<-filter(Corona_Cases,Province.State %in% c("Pennsylvania","Maryland","New York","New Jersey") & City %in% c("Bucks","Baltimore City", "New York","Burlington","Cape May"))

measure_vars_long<-c("Total_confirmed_cases.log","Total_confirmed_cases","Total_confirmed_deaths","Total_confirmed_deaths.log")
melt_arg_list<-list(variable.name = "case_type",value.name = "cases",measure.vars = c("Total_confirmed_cases","Total_confirmed_deaths"))
melt_arg_list$data=NULL


melt_arg_list$data=select(Corona_Cases.world,-ends_with(match = "log"))
Corona_Cases.world.long<-do.call(melt,melt_arg_list)
melt_arg_list$data=select(Corona_Cases.UScity,-ends_with(match = "log"))
Corona_Cases.UScity.long<-do.call(melt,melt_arg_list)
melt_arg_list$data=select(Corona_Cases.US_state,-ends_with(match = "log"))
Corona_Cases.US_state.long<-do.call(melt,melt_arg_list)

Corona_Cases.world.long$cases.log<-log(Corona_Cases.world.long$cases,10)
Corona_Cases.US_state.long$cases.log<-log(Corona_Cases.US_state.long$cases,10)
Corona_Cases.UScity.long$cases.log<-log(Corona_Cases.UScity.long$cases,10)


# what is the current death and total case load for US? For world? For states?
#-absolute
#-log

# what is mortality rate (US, world)
#-absolute

#how is death and case correlated? (US, world)
#-absolute
#Corona_Cases.US<-filter(Corona_Cases,Country.Region=="US" & Total_confirmed_cases>0)
#Corona_Cases.US.case100<-filter(Corona_Cases.US, Days_since_100>=0)
# linear model parameters
#(model_fit<-lm(formula = Total_confirmed_cases.log~Days_since_100,data= Corona_Cases.US.case100 ))

#(slope<-model_fit$coefficients[2])
#(intercept<-model_fit$coefficients[1])

# Correlation coefficient
#cor(x = Corona_Cases.US.case100$Days_since_100,y = Corona_Cases.US.case100$Total_confirmed_cases.log)

##------------------------------------------
## Plot World Data
##------------------------------------------
# Timestamp for world
timestamp_plot.world<-paste("Most recent date for which data available:",max(Corona_Cases.world$Date))#timestamp(quiet = T,prefix = "Updated ",suffix = " (EST)")


# Base template for plots
baseplot.world<-ggplot(data=NULL,aes(x=Days_since_100,col=Country.Region))+
  default_theme+
  scale_color_brewer(type = "qualitative",palette = "Paired")+
  ggtitle(paste("Log10 cases over time,",timestamp_plot.world))+
  theme(legend.position = "bottom",plot.title = element_text(size=12))


##/////////////////////////
### Plot Longitudinal cases

(Corona_Cases.world.long.plot<-baseplot.world+
    geom_point(data=Corona_Cases.world.long,aes(y=cases))+
    geom_line(data=Corona_Cases.world.long,aes(y=cases))+
    facet_wrap(~case_type,scales = "free_y",ncol=1)+
    ggtitle(timestamp_plot.world)
    )

(Corona_Cases.world.loglong.plot<-baseplot.world+
    geom_point(data=Corona_Cases.world.long,aes(y=cases.log))+
    geom_line(data=Corona_Cases.world.long,aes(y=cases.log))+
    facet_wrap(~case_type,scales = "free_y",ncol=1)+
    ggtitle(timestamp_plot.world))

##/////////////////////////
### Plot Longitudinal mortality rate

(Corona_Cases.world.mortality.plot<-baseplot.world+
    geom_point(data=Corona_Cases.world,aes(y=mortality_rate))+
    geom_line(data=Corona_Cases.world,aes(y=mortality_rate))+
    ylim(c(0,0.3))+
    ggtitle(timestamp_plot.world))
## Warning: Removed 100 rows containing missing values (geom_point).
## Warning: Removed 100 row(s) containing missing values (geom_path).

##/////////////////////////
### Plot death vs total case correlation

(Corona_Cases.world.casecor.plot<-ggplot(Corona_Cases.world,aes(x=Total_confirmed_cases,y=Total_confirmed_deaths,col=Country.Region))+
  geom_point()+
  geom_line()+
  default_theme+
  scale_color_brewer(type = "qualitative",palette = "Paired")+
  ggtitle(paste("Log10 cases over time,",timestamp_plot.world))+
  theme(legend.position = "bottom",plot.title = element_text(size=12))+
    ggtitle(timestamp_plot.world))

### Write polots

write_plot(Corona_Cases.world.long.plot,wd = results_dir)
## [1] "/Users/stevensmith/Projects/coronavirus/results/Corona_Cases.world.long.plot.png"
write_plot(Corona_Cases.world.loglong.plot,wd = results_dir)
## [1] "/Users/stevensmith/Projects/coronavirus/results/Corona_Cases.world.loglong.plot.png"
write_plot(Corona_Cases.world.mortality.plot,wd = results_dir)
## Warning: Removed 100 rows containing missing values (geom_point).

## Warning: Removed 100 row(s) containing missing values (geom_path).
## [1] "/Users/stevensmith/Projects/coronavirus/results/Corona_Cases.world.mortality.plot.png"
write_plot(Corona_Cases.world.casecor.plot,wd = results_dir)
## [1] "/Users/stevensmith/Projects/coronavirus/results/Corona_Cases.world.casecor.plot.png"
##------------------------------------------
## Plot US State Data
##-----------------------------------------

baseplot.US<-ggplot(data=NULL,aes(x=Days_since_100_state,col=case_type))+
  default_theme+
  facet_wrap(~Province.State)+
  ggtitle(paste("Log10 cases over time,",timestamp_plot.world))

Corona_Cases.US_state.long.plot<-baseplot.US+geom_point(data=Corona_Cases.US_state.long,aes(y=cases.log))
##------------------------------------------
## Plot US City Data
##-----------------------------------------

Corona_Cases.US.plotdata<-filter(Corona_Cases.US_state,Province.State %in% c("Pennsylvania","Maryland","New York","New Jersey") &
                                   City %in% c("Bucks","Baltimore City", "New York","Burlington","Cape May") &
                                   Total_confirmed_cases>0) 
timestamp_plot<-paste("Most recent date for which data available:",max(Corona_Cases.US.plotdata$Date))#timestamp(quiet = T,prefix = "Updated ",suffix = " (EST)")

city_colors<-c("Bucks"='#beaed4',"Baltimore City"='#386cb0', "New York"='#7fc97f',"Burlington"='#fdc086',"Cape May"="#e78ac3")

##/////////////////////////
### Plot death vs total case correlation

(Corona_Cases.city.loglong.plot<-ggplot(melt(Corona_Cases.US.plotdata,measure.vars = c("Total_confirmed_cases.log","Total_confirmed_deaths.log"),variable.name = "case_type",value.name = "cases"),aes(x=Date,y=cases,col=City,pch=case_type))+
  geom_point(size=4)+
    geom_line()+
  default_theme+
  #facet_wrap(~case_type)+
    ggtitle(paste("Log10 total and death cases over time,",timestamp_plot))+
theme(legend.position = "bottom",plot.title = element_text(size=12),axis.text.x = element_text(angle=45,hjust=1))+
    scale_color_manual(values = city_colors)+
  scale_x_date(date_breaks="1 week",date_minor_breaks="1 day"))

(Corona_Cases.city.long.plot<-ggplot(filter(Corona_Cases.US.plotdata,Province.State !="New York"),aes(x=Date,y=Total_confirmed_cases,col=City))+
  geom_point(size=4)+
  geom_line()+
  default_theme+
  facet_grid(~Province.State,scales = "free_y")+
  ggtitle(paste("MD, PA, NJ total cases over time,",timestamp_plot))+
  theme(legend.position = "bottom",plot.title = element_text(size=12),axis.text.x = element_text(angle=45,hjust=1))
+
  scale_color_manual(values = city_colors)+
  scale_x_date(date_breaks="1 week",date_minor_breaks="1 day"))

(Corona_Cases.city.mortality.plot<-ggplot(Corona_Cases.US.plotdata,aes(x=Date,y=mortality_rate,col=City))+
  geom_point(size=3)+
  geom_line(size=2)+
  default_theme+
  ggtitle(paste("Mortality rate (deaths/total) over time,",timestamp_plot))+
  theme(legend.position = "bottom",plot.title = element_text(size=12),axis.text.x = element_text(angle=45,hjust=1))+
  scale_color_manual(values = city_colors)+
  scale_x_date(date_breaks="1 week",date_minor_breaks="1 day"))

(Corona_Cases.city.casecor.plot<-ggplot(filter(Corona_Cases.US.plotdata,Province.State !="New York"),aes(y=Total_confirmed_deaths,x=Total_confirmed_cases,col=City))+
  geom_point(size=3)+
  geom_line(size=2)+
  default_theme+
  ggtitle(paste("Correlation of death vs total cases,",timestamp_plot))+
  theme(legend.position = "bottom",plot.title = element_text(size=12))+
  scale_color_manual(values = city_colors))

(Corona_Cases.city.long.normalized.plot<-ggplot(filter(Corona_Cases.US.plotdata,Province.State !="New York"),aes(x=Date,y=Total_confirmed_cases.per100,col=City))+
  geom_point(size=4)+
  geom_line()+
  default_theme+
  facet_grid(~Province.State)+
  ggtitle(paste("MD, PA, NJ total cases over time per 100 people,",timestamp_plot))+
  theme(legend.position = "bottom",plot.title = element_text(size=12),axis.text.x = element_text(angle=45,hjust=1))+
  scale_color_manual(values = city_colors)  +
  scale_x_date(date_breaks="1 week",date_minor_breaks="1 day"))

write_plot(Corona_Cases.city.long.plot,wd = results_dir_custom)
## [1] "/Users/stevensmith/Projects/coronavirus/results/custom/Corona_Cases.city.long.plot.png"
write_plot(Corona_Cases.city.loglong.plot,wd = results_dir_custom)
## [1] "/Users/stevensmith/Projects/coronavirus/results/custom/Corona_Cases.city.loglong.plot.png"
write_plot(Corona_Cases.city.mortality.plot,wd = results_dir_custom)
## [1] "/Users/stevensmith/Projects/coronavirus/results/custom/Corona_Cases.city.mortality.plot.png"
write_plot(Corona_Cases.city.casecor.plot,wd = results_dir_custom)
## [1] "/Users/stevensmith/Projects/coronavirus/results/custom/Corona_Cases.city.casecor.plot.png"
write_plot(Corona_Cases.city.long.normalized.plot,wd = results_dir_custom)
## [1] "/Users/stevensmith/Projects/coronavirus/results/custom/Corona_Cases.city.long.normalized.plot.png"

Q1b what is the model

Fit the cases to a linear model 1. Find time at which the case vs date becomes linear in each plot
2. Fit linear model for each city

# What is the predict # of cases for the next few days?
# How is the model performing historically?

Corona_Cases.US_state.summary<-ddply(Corona_Cases.US_state,
                                     c("Province.State","Date"),
                                     summarise,
                                     Total_confirmed_cases_perstate=sum(Total_confirmed_cases)) %>% 
    filter(Total_confirmed_cases_perstate>100)

# Compute the states with the most cases (for coloring and for linear model)
top_states_totals<-head(ddply(Corona_Cases.US_state.summary,c("Province.State"),summarise, Total_confirmed_cases_perstate.max=max(Total_confirmed_cases_perstate)) %>% arrange(-Total_confirmed_cases_perstate.max),n=max_colors)

kable(top_states_totals,caption = "Top 12 States, total count ")
top_states<-top_states_totals$Province.State

# Manually fix states so that Maryland is switched out for New York
top_states_modified<-c(top_states[top_states !="New York"],"Maryland")

# Plot with all states:
(Corona_Cases.US_state.summary.plot<-ggplot(Corona_Cases.US_state.summary,aes(x=Date,y=Total_confirmed_cases_perstate))+
  geom_point()+
  geom_point(data=filter(Corona_Cases.US_state.summary,Province.State %in% top_states),aes(col=Province.State))+
  scale_color_brewer(type = "qualitative",palette = "Paired")+
  default_theme+
  theme(axis.text.x = element_text(angle=45,hjust=1),legend.position = "bottom")+
  ggtitle("Total confirmed cases per state, top 12 colored")+
  scale_x_date(date_breaks="1 week",date_minor_breaks="1 day"))

##------------------------------------------
## Fit linear model to time vs total cases
##-----------------------------------------

# First, find the date at which each state's cases vs time becomes lienar (2nd derivative is about 0)
li<-ddply(Corona_Cases.US_state.summary,c("Province.State"),find_linear_index)

# Compute linear model for each state starting at the point at which data becomes linear
for(i in 1:nrow(li)){
  Province.State.i<-li[i,"Province.State"]
  date.i<-li[i,"V1"]
  data.i<-filter(Corona_Cases.US_state.summary,Province.State==Province.State.i & as.numeric(Date) >= date.i)
  model_results<-lm(data.i,formula = Total_confirmed_cases_perstate~Date)
  slope<-model_results$coefficients[2]
  intercept<-model_results$coefficients[1]
  li[li$Province.State==Province.State.i,"m"]<-slope
  li[li$Province.State==Province.State.i,"b"]<-intercept
  }

# Compute top state case load with fitted model

(Corona_Cases.US_state.lm.plot<-ggplot(filter(Corona_Cases.US_state.summary,Province.State %in% top_states_modified ))+
    geom_abline(data=filter(li,Province.State %in% top_states_modified),
                aes(slope = m,intercept = b,col=Province.State),lty=2)+
    geom_point(aes(x=Date,y=Total_confirmed_cases_perstate,col=Province.State))+
    scale_color_brewer(type = "qualitative",palette = "Paired")+
    default_theme+
    theme(axis.text.x = element_text(angle=45,hjust=1),legend.position = "bottom")+
    ggtitle("Total confirmed cases per state, top 12 colored")+
    scale_x_date(date_breaks="1 week",date_minor_breaks="1 day"))

##------------------------------------------
## Predict the number of total cases over the next week
##-----------------------------------------

predicted_days<-c(0,1,2,3,7)+as.numeric(as.Date("2020-04-20"))

predicted_days_df<-data.frame(matrix(ncol=3))
names(predicted_days_df)<-c("Province.State","days","Total_confirmed_cases_perstate")

# USe model parameters to estiamte case loads
for(state.i in top_states_modified){
  predicted_days_df<-rbind(predicted_days_df,
                           data.frame(Province.State=state.i,
                                      prediction_model(m = li[li$Province.State==state.i,"m"],
                                                       b =li[li$Province.State==state.i,"b"] ,
                                                       days =predicted_days )))
  }

predicted_days_df$Date<-as.Date(predicted_days_df$days,origin="1970-01-01")

kable(predicted_days_df,caption = "Predicted total cases over the next week for selected states")

##------------------------------------------
## Write plots
##-----------------------------------------

write_plot(Corona_Cases.US_state.summary.plot,wd = results_dir)
write_plot(Corona_Cases.US_state.lm.plot,wd = results_dir)

##------------------------------------------
## Write tables
##-----------------------------------------

write.csv(predicted_days_df,file = paste0(results_dir,"predicted_total_cases_days.csv"),quote = F,row.names = F)

Q2: What is the predicted number of cases?

What is the prediction of COVID-19 based on model thus far? Additional questions:

WHy did it take to day 40 to start a log linear trend? How long will it be till x number of cases? When will the plateu happen? Are any effects noticed with social distancing? Delays

##------------------------------------------
## Prediction and Prediction Accuracy
##------------------------------------------


today_num<-max(Corona_Cases.US$Days_since_100)
predicted_days<-today_num+c(1,2,3,7)

#mods = dlply(mydf, .(x3), lm, formula = y ~ x1 + x2)
#today:
Corona_Cases.US[Corona_Cases.US$Days_since_100==(today_num-1),]
Corona_Cases.US[Corona_Cases.US$Days_since_100==today_num,]
Corona_Cases.US$type<-"Historical"


#prediction_values<-prediction_model(m=slope,b=intercept,days = predicted_days)$Total_confirmed_cases

histoical_model<-data.frame(date=today_num,m=slope,b=intercept)
tmp<-data.frame(state=rep(c("A","B"),each=3),x=c(1,2,3,4,5,6))
tmp$y<-c(tmp[1:3,"x"]+5,tmp[4:6,"x"]*5+1)
ddply(tmp,c("state"))
lm(data =tmp,formula = y~x )

train_lm<-function(input_data,subset_coulmn,formula_input){
case_models <- dlply(input_data, subset_coulmn, lm, formula = formula_input)
case_models.parameters <- ldply(case_models, coef)
case_models.parameters<-rename(case_models.parameters,c("b"="(Intercept)","m"=subset_coulmn))
return(case_models.parameters)
}

train_lm(tmp,"state")

 dlply(input_data, subset_coulmn, lm,m=)
 
# model for previous y days
#historical_model_predictions<-data.frame(day_x=NULL,Days_since_100=NULL,Total_confirmed_cases=NULL,Total_confirmed_cases.log=NULL)
# for(i in c(1,2,3,4,5,6,7,8,9,10)){
#   #i<-1
# day_x<-today_num-i # 1, 2, 3, 4
# day_x_nextweek<-day_x+c(1,2,3)
# model_fit_x<-lm(data = filter(Corona_Cases.US.case100,Days_since_100 < day_x),formula = Total_confirmed_cases.log~Days_since_100)
# prediction_day_x_nextweek<-prediction_model(m = model_fit_x$coefficients[2],b = model_fit_x$coefficients[1],days = day_x_nextweek)
# prediction_day_x_nextweek$type<-"Predicted"
# acutal_day_x_nextweek<-filter(Corona_Cases.US,Days_since_100 %in% day_x_nextweek) %>% select(c(Days_since_100,Total_confirmed_cases,Total_confirmed_cases.log))
# acutal_day_x_nextweek$type<-"Historical"
# historical_model_predictions.i<-data.frame(day_x=day_x,rbind(acutal_day_x_nextweek,prediction_day_x_nextweek))
# historical_model_predictions<-rbind(historical_model_predictions.i,historical_model_predictions)
# }

#historical_model_predictions.withHx<-rbind.fill(historical_model_predictions,data.frame(Corona_Cases.US,type="Historical"))
#historical_model_predictions.withHx$Total_confirmed_cases.log2<-log(historical_model_predictions.withHx$Total_confirmed_cases,2)

(historical_model_predictions.plot<-ggplot(historical_model_predictions.withHx,aes(x=Days_since_100,y=Total_confirmed_cases.log,col=type))+
    geom_point(size=3)+
    default_theme+
    theme(legend.position = "bottom")+ 
      #geom_abline(slope = slope,intercept =intercept,lty=2)+
    #facet_wrap(~case_type,ncol=1)+
    scale_color_manual(values = c("Historical"="#377eb8","Predicted"="#e41a1c")))
write_plot(historical_model_predictions.plot,wd=results_dir)

Q3: What is the effect on social distancing, descreased mobility on case load?

Load data from Google which compoutes % change in user mobility relative to baseline for * Recreation
* Workplace
* Residence
* Park
* Grocery

Data from https://www.google.com/covid19/mobility/

# See pre-processing section for script on gathering mobility data

# UNDER DEVELOPMENT

mobility<-read.csv("/Users/stevensmith/Projects/MIT_COVID19/mobility.csv",header = T,stringsAsFactors = F)
#mobility$Retail_Recreation<-as.numeric(sub(mobility$Retail_Recreation,pattern = "%",replacement = ""))
#mobility$Workplace<-as.numeric(sub(mobility$Workplace,pattern = "%",replacement = ""))
#mobility$Residential<-as.numeric(sub(mobility$Residential,pattern = "%",replacement = ""))

##------------------------------------------
## Show relationship between mobility and caseload
##------------------------------------------
mobility$County<-gsub(mobility$County,pattern = " County",replacement = "")
Corona_Cases.US_state.mobility<-merge(Corona_Cases.US_state,plyr::rename(mobility,c("State"="Province.State","County"="City")))

#Corona_Cases.US_state.tmp<-merge(metadata,Corona_Cases.US_state.tmp)
# Needs to happen upsteam, see todos
#Corona_Cases.US_state.tmp$Total_confirmed_cases.perperson<-Corona_Cases.US_state.tmp$Total_confirmed_cases/as.numeric(Corona_Cases.US_state.tmp$Population)
mobility_measures<-c("Retail_Recreation","Grocery_Pharmacy","Parks","Transit","Workplace","Residential")

plot_data<-filter(Corona_Cases.US_state.mobility, Date.numeric==max(Corona_Cases.US_state$Date.numeric) ) %>% melt(measure.vars=mobility_measures) 
plot_data$value<-as.numeric(gsub(plot_data$value,pattern = "%",replacement = ""))
plot_data<-filter(plot_data,!is.na(value))

(mobility.plot<-ggplot(filter(plot_data,Province.State %in% c("Pennsylvania","Maryland","New Jersey","California","Delaware","Connecticut")),aes(y=Total_confirmed_cases.per100,x=value))+geom_point()+
  facet_grid(Province.State~variable,scales = "free")+
  xlab("Mobility change from baseline (%)")+
  ylab(paste0("Confirmed cases per 100 people(Today)"))+
  default_theme+
  ggtitle("Mobility change vs cases"))

(mobility.global.plot<-ggplot(plot_data,aes(y=Total_confirmed_cases.per100,x=value))+geom_point()+
  facet_wrap(~variable,scales = "free")+
  xlab("Mobility change from baseline (%)")+
  ylab(paste0("Confirmed cases (Today) per 100 people"))+
  default_theme+
  ggtitle("Mobility change vs cases"))

plot_data.permobility_summary<-ddply(plot_data,c("Province.State","variable"),summarise,cor=cor(y =Total_confirmed_cases.per100,x=value),median_change=median(x=value)) %>% arrange(-abs(cor))

kable(plot_data.permobility_summary,caption = "Ranked per-state mobility correlation with total confirmed cases")
Ranked per-state mobility correlation with total confirmed cases
Province.State variable cor median_change
Alaska Transit -1.0000000 -63.0
Delaware Retail_Recreation 1.0000000 -39.5
Delaware Grocery_Pharmacy 1.0000000 -17.5
Delaware Parks -1.0000000 20.5
Delaware Transit 1.0000000 -37.0
Delaware Workplace 1.0000000 -37.0
Delaware Residential -1.0000000 14.0
Hawaii Retail_Recreation 0.9931972 -56.0
Hawaii Grocery_Pharmacy 0.9695437 -34.0
New Hampshire Parks 0.9584602 -20.0
Connecticut Grocery_Pharmacy -0.9081367 -6.0
Maine Transit -0.9030169 -50.0
Alaska Residential 0.8898278 13.0
South Dakota Parks 0.8669580 -26.0
Utah Residential -0.8614252 12.0
Vermont Parks 0.8465671 -35.5
Alaska Grocery_Pharmacy -0.8078182 -7.0
Hawaii Residential -0.7854909 19.0
Utah Transit -0.7836851 -18.0
Massachusetts Workplace -0.7645647 -39.0
Connecticut Transit -0.7586359 -50.0
Rhode Island Workplace -0.7503039 -39.5
Wyoming Parks -0.7301529 -4.0
Alaska Workplace -0.7296146 -34.0
Wyoming Transit -0.7239989 -17.0
Utah Parks -0.6901361 17.0
Hawaii Parks 0.6813458 -72.0
Vermont Grocery_Pharmacy -0.6546225 -25.0
Utah Workplace -0.6492643 -37.0
New York Workplace -0.6466677 -34.5
Maine Workplace -0.6452245 -30.0
Arizona Grocery_Pharmacy -0.6385809 -15.0
Rhode Island Retail_Recreation -0.6273853 -45.0
Montana Workplace -0.6239388 -40.5
Hawaii Transit 0.6188732 -89.0
Rhode Island Residential -0.6164663 18.5
New Jersey Workplace -0.6061341 -44.0
Nebraska Workplace 0.6061324 -32.5
New Jersey Parks -0.5961811 -6.0
New York Retail_Recreation -0.5870289 -46.0
North Dakota Retail_Recreation -0.5409820 -42.0
Hawaii Workplace 0.5396454 -46.0
Connecticut Residential 0.5375800 14.0
New York Parks 0.5272475 20.0
Massachusetts Retail_Recreation -0.5200737 -44.0
North Dakota Parks 0.5180255 -34.0
Connecticut Retail_Recreation -0.5171762 -45.0
Arizona Retail_Recreation -0.5063328 -42.5
New Jersey Retail_Recreation -0.5059616 -62.5
Maine Parks 0.5043227 -31.0
Connecticut Workplace -0.4994508 -39.0
Montana Parks -0.4913929 -58.0
Wyoming Workplace -0.4879643 -31.0
Nebraska Residential -0.4851611 14.0
New Jersey Grocery_Pharmacy -0.4842532 2.5
New Mexico Grocery_Pharmacy -0.4809136 -11.0
Rhode Island Parks 0.4729613 52.0
Iowa Parks -0.4714615 28.5
Montana Residential 0.4701424 14.0
New Mexico Parks 0.4502554 -31.5
Illinois Transit -0.4490098 -31.0
Kansas Parks 0.4462524 72.0
New Mexico Residential 0.4459560 13.5
Vermont Residential 0.4368806 11.5
Kentucky Parks -0.4360494 28.5
Pennsylvania Workplace -0.4333501 -36.0
New Jersey Transit -0.4288781 -50.5
California Transit -0.4276659 -42.0
Idaho Workplace -0.4272893 -29.0
South Carolina Workplace 0.4265329 -30.0
Arizona Residential 0.4252443 13.0
Massachusetts Grocery_Pharmacy -0.4210187 -7.0
Wisconsin Transit -0.4186413 -23.5
California Residential 0.4152580 14.0
Montana Retail_Recreation -0.4145442 -51.0
New Hampshire Residential -0.4126391 14.0
Idaho Grocery_Pharmacy -0.4028251 -4.5
Idaho Transit -0.3997869 -30.0
Maryland Workplace -0.3985922 -35.0
Maryland Grocery_Pharmacy -0.3934092 -10.0
Alabama Grocery_Pharmacy -0.3907619 -2.0
Montana Transit -0.3892824 -41.0
Alabama Workplace -0.3888324 -29.0
Arizona Transit 0.3825429 -38.0
Nevada Transit -0.3776545 -20.0
New York Transit -0.3734464 -48.0
West Virginia Parks 0.3664071 -33.0
Wyoming Grocery_Pharmacy -0.3642220 -10.0
Pennsylvania Retail_Recreation -0.3571931 -45.0
New Mexico Retail_Recreation -0.3559498 -42.5
Arkansas Parks -0.3428492 -12.0
Michigan Parks 0.3419459 30.0
Nebraska Grocery_Pharmacy 0.3356713 -0.5
Florida Residential 0.3353193 14.0
Alabama Transit -0.3350366 -36.5
Pennsylvania Parks 0.3335458 13.0
California Parks -0.3304668 -38.5
Montana Grocery_Pharmacy -0.3279939 -16.0
Alaska Retail_Recreation 0.3260634 -39.0
Minnesota Transit -0.3126123 -28.5
Maine Retail_Recreation -0.3082105 -42.0
North Carolina Grocery_Pharmacy 0.3058269 0.0
West Virginia Grocery_Pharmacy -0.3036763 -6.0
Vermont Retail_Recreation 0.3010980 -57.0
Idaho Retail_Recreation -0.2950909 -40.5
Colorado Residential 0.2861895 14.0
North Dakota Workplace 0.2839302 -40.0
Maryland Retail_Recreation -0.2833815 -39.0
Nevada Residential 0.2816997 17.0
Mississippi Residential 0.2805443 13.0
Arkansas Retail_Recreation -0.2779674 -30.0
Texas Residential -0.2760566 15.0
Rhode Island Transit -0.2749771 -56.0
Virginia Transit -0.2722920 -33.0
Vermont Workplace -0.2687562 -43.0
North Carolina Workplace 0.2683601 -31.0
Utah Retail_Recreation -0.2675410 -40.0
Kansas Workplace 0.2669918 -32.5
Oregon Grocery_Pharmacy 0.2640656 -7.0
Maryland Residential 0.2638110 15.0
Nevada Retail_Recreation -0.2630363 -43.0
Texas Workplace 0.2596640 -32.0
Rhode Island Grocery_Pharmacy 0.2590564 -7.5
Illinois Workplace -0.2517388 -31.0
Tennessee Workplace -0.2511445 -31.0
Texas Parks 0.2509584 -42.0
California Grocery_Pharmacy -0.2489340 -11.5
Tennessee Residential 0.2481912 11.5
California Retail_Recreation -0.2476241 -44.0
Wisconsin Parks 0.2462875 51.5
Florida Parks -0.2456155 -43.0
Illinois Parks 0.2437659 26.5
South Carolina Parks -0.2421154 -23.0
Pennsylvania Grocery_Pharmacy -0.2394224 -6.0
Georgia Grocery_Pharmacy -0.2369499 -10.0
New York Grocery_Pharmacy -0.2313220 8.0
Missouri Residential -0.2304278 13.0
Arkansas Residential 0.2302419 12.0
Washington Workplace -0.2247289 -38.0
California Workplace -0.2216927 -36.0
North Carolina Transit 0.2141866 -32.0
Idaho Residential -0.2139764 11.0
Michigan Workplace -0.2129689 -40.0
North Carolina Residential 0.2118888 13.0
New Jersey Residential 0.2101825 18.0
Kansas Grocery_Pharmacy -0.2005186 -14.0
Oregon Residential 0.1959444 10.5
Iowa Transit 0.1926479 -24.0
Missouri Workplace 0.1921238 -28.5
Illinois Residential 0.1918862 14.0
Mississippi Grocery_Pharmacy -0.1877287 -8.0
Georgia Workplace -0.1873048 -33.5
South Dakota Transit -0.1869489 -40.0
Colorado Parks -0.1784104 2.0
North Dakota Grocery_Pharmacy -0.1727360 -8.0
Georgia Retail_Recreation -0.1715501 -41.0
Virginia Grocery_Pharmacy -0.1667542 -8.0
Wisconsin Residential -0.1661594 14.0
Virginia Residential 0.1653950 14.0
Florida Retail_Recreation 0.1640526 -43.0
New Mexico Transit 0.1639043 -38.5
Connecticut Parks 0.1628726 43.0
Ohio Transit 0.1626294 -28.0
Washington Residential 0.1582548 13.0
Georgia Residential -0.1551805 13.0
South Carolina Residential -0.1527366 12.0
Oklahoma Residential 0.1506231 15.0
Virginia Parks 0.1493325 6.0
South Dakota Retail_Recreation -0.1479179 -38.5
Alabama Parks 0.1469934 -1.0
Indiana Retail_Recreation 0.1451846 -38.0
Minnesota Parks 0.1449439 -9.0
New Hampshire Retail_Recreation -0.1437423 -41.0
Mississippi Transit -0.1437061 -38.5
Massachusetts Parks 0.1414052 39.0
North Dakota Transit 0.1409119 -48.0
Indiana Residential 0.1374623 12.0
Michigan Retail_Recreation -0.1369377 -53.0
Massachusetts Transit -0.1354371 -45.0
Washington Grocery_Pharmacy 0.1339896 -7.0
Maine Residential -0.1319279 11.0
Oregon Retail_Recreation 0.1312867 -41.0
North Carolina Parks -0.1308373 7.0
Alabama Retail_Recreation 0.1293122 -39.0
Pennsylvania Transit -0.1289826 -41.5
Wyoming Retail_Recreation -0.1285684 -39.0
Washington Transit -0.1285345 -33.5
South Dakota Residential 0.1274931 15.0
Ohio Parks -0.1236796 67.5
Texas Transit 0.1203137 -41.0
New Hampshire Grocery_Pharmacy -0.1180567 -6.0
Oregon Parks 0.1172530 16.5
Florida Workplace -0.1169718 -33.0
Massachusetts Residential 0.1150486 15.0
Oklahoma Parks -0.1149756 -18.5
Kansas Transit -0.1133419 -26.5
Wisconsin Workplace -0.1128686 -31.0
Mississippi Retail_Recreation -0.1126496 -40.0
Minnesota Workplace -0.1101149 -33.0
Kentucky Grocery_Pharmacy 0.1098535 4.0
Maine Grocery_Pharmacy -0.1092672 -13.0
Arkansas Workplace -0.1088347 -26.0
Idaho Parks 0.1083222 -22.0
Texas Grocery_Pharmacy 0.1066133 -14.0
Arkansas Transit 0.1058598 -27.0
Ohio Residential 0.1053706 14.0
Maryland Transit -0.1053568 -39.0
Indiana Parks -0.1052708 29.0
Arizona Workplace -0.1026300 -35.0
Nebraska Retail_Recreation 0.1018903 -36.0
Wyoming Residential 0.1012112 12.5
Minnesota Retail_Recreation 0.1006920 -40.0
Wisconsin Grocery_Pharmacy 0.1001294 -1.0
Georgia Parks 0.0948271 -6.0
Missouri Transit -0.0916935 -24.5
New York Residential 0.0916145 17.5
Oklahoma Grocery_Pharmacy -0.0913012 -1.0
Mississippi Workplace -0.0901191 -33.0
Virginia Workplace -0.0898596 -31.5
Pennsylvania Residential 0.0891985 15.0
New Hampshire Transit -0.0889069 -57.0
Indiana Workplace 0.0882930 -34.0
West Virginia Residential -0.0871474 11.0
South Dakota Workplace 0.0862154 -35.0
Michigan Grocery_Pharmacy -0.0812744 -11.0
South Carolina Transit 0.0807930 -45.0
Virginia Retail_Recreation -0.0798877 -35.0
Kentucky Transit 0.0794671 -31.0
Nebraska Parks 0.0771428 55.5
Colorado Transit 0.0751106 -36.0
Indiana Grocery_Pharmacy -0.0750564 -5.5
Kentucky Retail_Recreation 0.0747466 -29.0
Michigan Residential 0.0740789 15.0
Tennessee Parks -0.0727419 10.5
Ohio Grocery_Pharmacy 0.0700577 0.0
Nevada Parks 0.0698748 -12.5
Washington Parks 0.0695633 -3.5
Michigan Transit 0.0693831 -46.0
Nebraska Transit -0.0685845 -9.0
North Carolina Retail_Recreation 0.0674547 -34.0
South Carolina Retail_Recreation -0.0659246 -35.0
Minnesota Grocery_Pharmacy 0.0642441 -6.0
Oregon Workplace -0.0630496 -31.0
Ohio Retail_Recreation 0.0617815 -36.0
South Dakota Grocery_Pharmacy 0.0608018 -9.0
West Virginia Workplace 0.0602897 -33.0
West Virginia Retail_Recreation -0.0573394 -38.5
Oklahoma Workplace 0.0567011 -31.0
Washington Retail_Recreation -0.0537001 -42.0
North Dakota Residential -0.0532812 17.0
Iowa Retail_Recreation -0.0518500 -38.0
Oregon Transit 0.0507563 -27.5
South Carolina Grocery_Pharmacy 0.0493990 1.0
New Hampshire Workplace 0.0492952 -37.0
Missouri Parks 0.0488683 0.0
Florida Grocery_Pharmacy 0.0486952 -14.0
Kentucky Residential 0.0473650 12.0
Missouri Grocery_Pharmacy 0.0456989 2.0
Missouri Retail_Recreation -0.0455021 -36.0
Kentucky Workplace -0.0449154 -36.0
Arizona Parks -0.0444982 -44.5
Illinois Grocery_Pharmacy -0.0416868 2.0
Illinois Retail_Recreation 0.0413075 -40.0
West Virginia Transit -0.0400159 -45.0
Florida Transit -0.0381769 -49.0
Texas Retail_Recreation 0.0376074 -40.0
Nevada Workplace 0.0354729 -40.0
Indiana Transit 0.0350104 -29.0
Colorado Grocery_Pharmacy -0.0347538 -17.0
Colorado Retail_Recreation -0.0311133 -44.0
Tennessee Transit -0.0292326 -32.0
Ohio Workplace -0.0285293 -35.0
Tennessee Grocery_Pharmacy 0.0259319 6.0
Minnesota Residential -0.0251435 17.0
Oklahoma Retail_Recreation 0.0244485 -31.0
Utah Grocery_Pharmacy 0.0238056 -4.0
Iowa Workplace -0.0224526 -30.0
Mississippi Parks -0.0213193 -25.0
Wisconsin Retail_Recreation 0.0198518 -44.0
Kansas Residential -0.0174066 13.0
Alabama Residential -0.0135226 11.0
Kansas Retail_Recreation -0.0128805 -37.0
Georgia Transit -0.0126093 -35.0
New Mexico Workplace 0.0115647 -34.0
Nevada Grocery_Pharmacy 0.0109606 -12.5
Iowa Residential -0.0108944 13.0
Maryland Parks -0.0098727 27.0
Vermont Transit 0.0080600 -63.0
Colorado Workplace 0.0064375 -39.0
Oklahoma Transit 0.0062005 -26.0
Tennessee Retail_Recreation -0.0043902 -30.0
Iowa Grocery_Pharmacy 0.0027622 4.0
Arkansas Grocery_Pharmacy 0.0003171 3.0
Alaska Parks NA 29.0
District of Columbia Retail_Recreation NA -69.0
District of Columbia Grocery_Pharmacy NA -28.0
District of Columbia Parks NA -65.0
District of Columbia Transit NA -69.0
District of Columbia Workplace NA -48.0
District of Columbia Residential NA 17.0
# sanity check
ggplot(filter(plot_data,Province.State %in% c("Pennsylvania","Maryland","New Jersey","California","Delaware","Connecticut")),aes(x=Total_confirmed_cases.per100,fill=variable))+geom_histogram()+
  facet_grid(~Province.State)+
    default_theme+
  theme(legend.position = "bottom")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

write_plot(mobility.plot,wd = results_dir)
## [1] "/Users/stevensmith/Projects/coronavirus/results/mobility.plot.png"
write_plot(mobility.global.plot,wd = results_dir)
## [1] "/Users/stevensmith/Projects/coronavirus/results/mobility.global.plot.png"
(plot_data.permobility_summary.plot<-ggplot(plot_data.permobility_summary,aes(x=variable,y=median_change))+
  geom_jitter(size=2,width=.2)+
  #geom_jitter(data=plot_data.permobility_summary %>% arrange(-abs(median_change)) %>% head(n=15),aes(col=Province.State),size=2,width=.2)+
  default_theme+
  ggtitle("Per-Sate Median Change in Mobility")+
  xlab("Mobility Meaure")+
  ylab("Median Change from Baseline"))

write_plot(plot_data.permobility_summary.plot,wd = results_dir)
## [1] "/Users/stevensmith/Projects/coronavirus/results/plot_data.permobility_summary.plot.png"

DELIVERABLE MANIFEST

The following link to commited documents pushed to github. These are provided as a convienence, but note this is a manual process. The generation of reports, plots and tables is not coupled to the execution of this markdown. ## Report This report, html & pdf

Plots

github_root<-"https://github.com/sbs87/coronavirus/blob/master/"

plot_handle<-c("Corona_Cases.world.long.plot",
               "Corona_Cases.world.loglong.plot",
               "Corona_Cases.world.mortality.plot",
               "Corona_Cases.world.casecor.plot",
               "Corona_Cases.city.long.plot",
               "Corona_Cases.city.loglong.plot",
               "Corona_Cases.city.mortality.plot",
               "Corona_Cases.city.casecor.plot",
               "Corona_Cases.city.long.normalized.plot",
               "Corona_Cases.US_state.lm.plot",
               "Corona_Cases.US_state.summary.plot")


deliverable_manifest<-data.frame(
  name=c("World total & death cases, longitudinal",
         "World log total & death cases, longitudinal",
         "World mortality",
         "World total & death cases, correlation",
         "City total & death cases, longitudinal",
         "City log total & death cases, longitudinal",
         "City mortality",
         "City total & death cases, correlation",
         "City population normalized total & death cases, longitudinal",
         "State total cases (select) with linear model, longitudinal",
         "State total cases, longitudinal"),
  plot_handle=plot_handle,
  link=paste0(github_root,"results/",plot_handle,".png")
)


(tmp<-data.frame(row_out=apply(deliverable_manifest,MARGIN = 1,FUN = function(x) paste(x[1],x[2],x[3],sep=" | "))))
##                                                                                                                                                                                                        row_out
## 1                                           World total & death cases, longitudinal | Corona_Cases.world.long.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.long.plot.png
## 2                                 World log total & death cases, longitudinal | Corona_Cases.world.loglong.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.loglong.plot.png
## 3                                                         World mortality | Corona_Cases.world.mortality.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.mortality.plot.png
## 4                                      World total & death cases, correlation | Corona_Cases.world.casecor.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.casecor.plot.png
## 5                                              City total & death cases, longitudinal | Corona_Cases.city.long.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.long.plot.png
## 6                                    City log total & death cases, longitudinal | Corona_Cases.city.loglong.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.loglong.plot.png
## 7                                                            City mortality | Corona_Cases.city.mortality.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.mortality.plot.png
## 8                                         City total & death cases, correlation | Corona_Cases.city.casecor.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.casecor.plot.png
## 9  City population normalized total & death cases, longitudinal | Corona_Cases.city.long.normalized.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.long.normalized.plot.png
## 10                     State total cases (select) with linear model, longitudinal | Corona_Cases.US_state.lm.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.US_state.lm.plot.png
## 11                                      State total cases, longitudinal | Corona_Cases.US_state.summary.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.US_state.summary.plot.png
row_out<-apply(tmp, 2, paste, collapse="\t\n")
name handle link
World total & death cases, longitudinal Corona_Cases.world.long.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.long.plot.png
World log total & death cases, longitudinal Corona_Cases.world.loglong.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.loglong.plot.png
World mortality Corona_Cases.world.mortality.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.mortality.plot.png
World total & death cases, correlation Corona_Cases.world.casecor.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.casecor.plot.png
City total & death cases, longitudinal Corona_Cases.city.long.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.long.plot.png
City log total & death cases, longitudinal Corona_Cases.city.loglong.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.loglong.plot.png
City mortality Corona_Cases.city.mortality.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.mortality.plot.png
City total & death cases, correlation Corona_Cases.city.casecor.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.casecor.plot.png
City population normalized total & death cases, longitudinal Corona_Cases.city.long.normalized.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.long.normalized.plot.png
State total cases (select) with linear model, longitudinal Corona_Cases.US_state.lm.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.US_state.lm.plot.png
State total cases, longitudinal Corona_Cases.US_state.summary.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.US_state.summary.plot.png

Tables

CONCLUSION

Overall, the trends of COVID-19 cases is no longer in log-linear phase for world or U.S. (but some regions like MD are still in the log-linear phase). Mortality rate (deaths/confirmed RNA-based cases) is >1%, with a range depending on region. Mobility is not a strong indicator of caseload (U.S. data).

See table below for detailed breakdown.

Question Answer
What is the effect on social distancing, descreased mobility on case load?
There is not a strong apparent effect on decreased mobility (work, grocery, retail) or increased mobility (at residence, parks) on number of confirmed cases, either as a country (U.S.) or state level. California appears to have one of the best correlations, but this is a mixed bag
What is the trend in cases, mortality across geopgraphical regions?
The confirmed total casees and mortality is overall log-linear for most countries, with a trailing off beginning for most (inlcuding U.S.). On the state level, NY, NJ, PA starting to trail off; MD is still in log-linear phase. Mortality and case load are highly correlated for NY, NJ, PA, MD. The mortality rate flucutates for a given region, but is about 3% overall.

END

End: ##—— Tue May 26 20:36:24 2020 ——##

Cheatsheet: http://rmarkdown.rstudio.com>

Sandbox

# Geographical heatmap!
install.packages("maps")
library(maps)
library
mi_counties <- map_data("county", "pennsylvania") %>% 
  select(lon = long, lat, group, id = subregion)
head(mi_counties)

ggplot(mi_counties, aes(lon, lat)) + 
  geom_point(size = .25, show.legend = FALSE) +
  coord_quickmap()
mi_counties$cases<-1:2226
name_overlaps(metadata,Corona_Cases.US_state)

tmp<-merge(Corona_Cases.US_state,metadata)
ggplot(filter(tmp,Province.State=="Pennsylvania"), aes(Long, Lat, group = as.factor(City))) +
  geom_polygon(aes(fill = Total_confirmed_cases), colour = "grey50") + 
  coord_quickmap()


ggplot(Corona_Cases.US_state, aes(Long, Lat))+
  geom_polygon(aes(fill = Total_confirmed_cases ), color = "white")+
  scale_fill_viridis_c(option = "C")
dev.off()


require(maps)
require(viridis)

world_map <- map_data("world")
ggplot(world_map, aes(x = long, y = lat, group = group)) +
  geom_polygon(fill="lightgray", colour = "white")

head(world_map)
head(Corona_Cases.US_state)
unique(select(world_map,c("region","group"))) %>% filter()

some.eu.countries <- c(
  "US"
)
# Retrievethe map data
some.eu.maps <- map_data("world", region = some.eu.countries)

# Compute the centroid as the mean longitude and lattitude
# Used as label coordinate for country's names
region.lab.data <- some.eu.maps %>%
  group_by(region) %>%
  summarise(long = mean(long), lat = mean(lat))

unique(filter(some.eu.maps,subregion %in% Corona_Cases.US_state$Province.State) %>% select(subregion))
unique(Corona_Cases.US_state$Total_confirmed_cases.log)
ggplot(filter(Corona_Cases.US_state,Date=="2020-04-17") aes(x = Long, y = Lat)) +
  geom_polygon(aes( fill = Total_confirmed_cases.log))+
  #geom_text(aes(label = region), data = region.lab.data,  size = 3, hjust = 0.5)+
  #scale_fill_viridis_d()+
  #theme_void()+
  theme(legend.position = "none")
library("sf")
library("rnaturalearth")
library("rnaturalearthdata")

world <- ne_countries(scale = "medium", returnclass = "sf")
class(world)
ggplot(data = world) +
    geom_sf()

counties <- st_as_sf(map("county", plot = FALSE, fill = TRUE))
counties <- subset(counties, grepl("florida", counties$ID))
counties$area <- as.numeric(st_area(counties))
#install.packages("lwgeom")
class(counties)
head(counties)
ggplot(data = world) +
    geom_sf(data=Corona_Cases.US_state) +
    #geom_sf(data = counties, aes(fill = area)) +
  geom_sf(data = counties, aes(fill = area)) +
   # scale_fill_viridis_c(trans = "sqrt", alpha = .4) +
    coord_sf(xlim = c(-88, -78), ylim = c(24.5, 33), expand = FALSE)


head(counties)
tmp<-unique(select(filter(Corona_Cases.US_state,Date=="2020-04-17"),c(Lat,Long,Total_confirmed_cases.per100)))
st_as_sf(map("county", plot = FALSE, fill = TRUE))

join::inner_join.sf(Corona_Cases.US_state, counties)

library(sf)
library(sp)

nc <- st_read(system.file("shape/nc.shp", package="sf"))
class(nc)


spdf <- SpatialPointsDataFrame(coords = select(Corona_Cases.US_state,c("Lat","Long")), data = Corona_Cases.US_state,
                               proj4string = CRS("+proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0"))

head(spdf)
class(spdf)
st_cast(spdf)

filter(Corona_Cases.US_state.summary,Date=="2020-04-20" & Province.State %in% top_states_modified)
id

https://stevenbsmith.net